Introduction

This analysis seeks to create a predictive model for housing prices in San Francisco using local data, which will hopefully lead to a more specialized understanding of the housing market of the city. A model more attuned to the characteristics of a locale will provide a more accurate assessment of home prices. Accuracy is not only a concern for realtors and home buyers, but also those invested in public policy. Home prices have broad implications for municipal tax revenues and the shape of redevelopment in general. Nevertheless, the focus of this endeavor is to find a better predictive model for Zillow.

Although analysts possess a wealth of data on San Francisco, there are numerous obstacles to an easy predictive model. Low-rise and largely constructed between 1905 and 1940, the city’s housing stock is relatively uniform, making it relatively difficult to tease out distinctions through structural qualities. Similarly, a civic emphasis on historical preservation, local rent controls, and the most expensive housing market in the country may have an unseen influence on the direction of prices, muddying the analysis environment.

Our approach to modeling was to isolate variables that reflect high levels of affluence, as ownership and neighborhood demographics appear to be a more fertile ground for analysis. Although characteristics such as the number of bedrooms may correlate with price in certain neighborhoods, across the city, we could not find a significant relationship between price and many variables which conventional wisdom would dictate as useful.

An exploratory analysis on several environmental factors including elevation, distance from key modes of transportation, and number of nearby trees provided no significant results. Data from the American Community Survey’s five-year estimate and distances from certain key thoroughfares were the only factors that returned viable variables.

## 
## Summary Statistics
## ========================================================================================
## Statistic              N       Mean      St. Dev.    Min   Pctl(25)  Pctl(75)     Max   
## ----------------------------------------------------------------------------------------
## Sale Price           5,250 1,215,628.00 745,984.70 100,001  715,003  1,486,503 4,695,001
## Mrtg 3k              5,250    52.90       17.48     0.00     39.50     64.50     94.70  
## Pct White            5,250    41.64       22.35     0.00     22.20     58.20     83.40  
## HS Grad              5,250    12.95        7.62     0.00     6.60      18.40     37.00  
## Beds                 5,250     3.04        1.05       1        2         4        10    
## Property Area        5,250   1,792.99     725.68     261     1,275    2,160.8    6,726  
## Grad Degree          5,250    22.32       12.24     0.00     13.50     31.70     46.90  
## Income Over 75k      5,250    26.95       14.05       0       16        39        60    
## Neighborhood Premium 5,250     1.03        0.41     0.00     0.64      1.21      2.97   
## Open Space Dist      5,250    710.08      491.51    9.53    330.12   1,000.62  2,738.74 
## Vacant Land Dist     5,250    378.48      307.53    0.00    157.58    509.96   2,087.49 
## Bikeway Dist         5,250    622.83      493.40    32.42   215.04    926.70   2,653.52 
## Market St Dist       5,250   9,697.63    5,035.58   60.64  5,898.41  13,773.37 21,968.20
## Embarcadero Dist     5,250  21,357.57    7,392.79  326.67  16,494.60 27,289.85 38,004.50
## ----------------------------------------------------------------------------------------

Data

The primary dataset provided with the assignment provided several of the variables used in the predictive model. Other datasets were selected from the OpenDataSF platform and the 2015 American Community Survey (5-year). From OpenDataSF, the model relies on datasets containing the location of officially-recognized neighborhoods, open space parcels, vacant parcels, bikeways, and street data – mainly two of SF’s main streets: Market and The Embarcadero.

Specific community characteristics collected from ACS data at the tract level – share of homeowners with a mortgage that spend $3,000 or more on their housing costs, share of population that self-identifies as white, share of population with at least a high school diploma, share of population with a graduate/post-professional degree or higher, and share of individuals with incomes of over $75,000 – were all considered when building the predictive model.

Scatterplots of Selected Variables

A Particularly Useful Variable

The Neighborhood Premium represents the market preference for homes in specific neighborhoods. This value was calculated by taking the mean of all home sales in each neighborhood from the given data, and dividing the neighborhood mean by the mean sale price of all homes in San Francisco. A neighborhood premium greater than 1, means that buyers are willing to pay a premium to live in that neighborhood while home prices in neighborhoods with premiums less than 1 signify an undesirable neighborhood where home prices are expected to be less, respective to all other homes for sale in San Francisco. For example, homes in the Seacliff neighborhood were, on average, 2.2x more expensive than homes throughout San Francisco demonstrating buyers are willing to pay a premium to live in the neighborhood.

Methods

After reviewing the possible home sale price indicators for correlations, the linear model uses 8 evenly-weighted variables to predict home price.

            Neighborhood Premium
            Share of homeowners (With a mortgage) with housing costs >$3,000 per census tract
            Share of white population per census tract
            Share of population with at least a high school diploma per census tract
            Share of population with a graduate degree or higher per census tract
            Share of population with an income <$75,000 per census tract
            Number of bedrooms
            Property Area
            

Our model was put through “cross-validation” which uses the model to predict the sale price for a few homes at a time, for which the sale price is already known. Using this method, the home prices that the model predicts can be compared to the actual home prices to determine the accuracy of the model. Since there were over 5,000 homes which had all the data required for the model, this method produces a reliable result. Because the model relies on multiple variables, if a home has missing data (Ex: number of bedrooms), there are redundant variables that can be used to estimate the price of a home. Below is the result of the training set of our model.

## Reading layer `ALL_WithVars' from data source `C:\Users\bndodson\Documents\MUSA507\Midterm_SFO\ALL_WithVars\ALL_WithVars.shp' using driver `ESRI Shapefile'
## Simple feature collection with 10127 features and 38 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 5980569 ymin: 2086052 xmax: 6020464 ymax: 2121818
## epsg (SRID):    NA
## proj4string:    +proj=lcc +lat_1=37.06666666666667 +lat_2=38.43333333333333 +lat_0=36.5 +lon_0=-120.5 +x_0=2000000 +y_0=500000.0000000001 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs

Intercept NeighborhoodPremium MortgageOver3k PctWhite HSGrad WithGradDegree IncomeOver75k Beds PropArea
-660831.4 672077.8 -441.8 6913.7 5865 2157 233.2 -57684.4 540.3
R2 MAE MAPE
0.63 333201.5 0.331354

Histogram of MAE

Based on the cross-validation test used to build the model, when combined, the 8 variables can predict the sale price of homes in San Francisco within about $300,000. Although above the baseline, our model still encountered difficulty accounting for variation in price.

Through a Moran’s I test we can see that many of the errors in our model exhibit a moderate level of spatial autocorrelation. In other words, our errors are clustered together by location.

In the following map, which shows our model’s MAPE by neighborhood, we can see that our model’s errors were particularly large in the north and southeast sections of the city. The former features relatively high prices and the latter features relatively low prices. As mentioned in the introduction, we trained our model on variables that reflect affluence and closeness to a couple of main thoroughfares. Although these variables exhibited relatively high correlations with sale prices, they naturally missed the market on prices further away from the center and in neighborhoods that were relatively not affluent. Accounting for the prices of neighboring houses (spatial lag) would have mitigated some of these problems.

## Reading layer `ALL_WithVars' from data source `C:\Users\bndodson\Documents\MUSA507\Midterm_SFO\ALL_WithVars\ALL_WithVars.shp' using driver `ESRI Shapefile'
## Simple feature collection with 10127 features and 38 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 5980569 ymin: 2086052 xmax: 6020464 ymax: 2121818
## epsg (SRID):    NA
## proj4string:    +proj=lcc +lat_1=37.06666666666667 +lat_2=38.43333333333333 +lat_0=36.5 +lon_0=-120.5 +x_0=2000000 +y_0=500000.0000000001 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs

Conclusion The home sale price prediction model has been tuned to the unique characteristics of San Francisco. A market leader in home sale price prediction, Zillow, would need a completely different model as their service predicts home price throughout the United States. Zillow needs a model robust enough for local generalizability for which our model is poor at predicting. Zillow’s model has demonstrated itself as a robust home sale price predictor because it has rich sales data from the MLS but this is fairly open data. Zillow likely succeeds because it also has data on the users of its platform that are searching for homes. Zillow has data on which filters buyers are using for their home searches. If for example, it knew that 80% of users selected a specific number of bedrooms to refine their search, it could know to weight the number of bedrooms in a sale price higher than say, the presence of a fireplace. Zillow also has a native feature to search for a specific neighborhood and can limit your search to that neighborhood’s boundaries. This provides a large amount of data because it can learn from user’s preference for specific neighborhoods. The home sale price prediction model created lacks this user data. However, the data used in the model is based on data for homes that have sold. There are many users on Zillow who are not seriously purchasing a home. This can influence Zillow’s data. There is no guarantee that user data is a better indicator of home sale price than the MLS data.

The City of San Francisco could build a new tax value prediction model like the one used by Zillow with an additional important factor. If the city wants to strike a balance between unsustainable gentrification and maximizing property tax revenue, the city could use home sale data to calculate at what tax value growth rate year-over-year leads to home sales. The longer residents live in their homes, the less likely gentrification will push out longtime residents. If, for example, the city was able to figure out that when the property tax value increases say 6% or more, residents are 40% more likely to sell their home within a year, the city could put a 6% cap on property tax value increases. This would decrease the number of residents leaving their neighborhood, creating greater resident ownership of the neighborhood, driving down gentrification. Understandably, this approach would create several new problems including a decrease in the available supply of homes for sale in San Francisco and thereby increasing the sale price of those that are for sale. However, this could simply be a start. It exemplifies how the City of San Francisco can use an economic prediction model to solve a social problem. This model created could be used to create an index of which homes have attributes that make them more desirable than others. A neighborhood with a high share of these homes could be identified as one susceptible to gentrification. Combining this gentrification index model, with Zillow data, would make a strong model however since Zillow’s user data is not open source, municipalities will still have to rely on their own data. In the future, better predictive models can levy anonymous user data, with home statistics to create a more equitable property tax revenue stream. The model created is an example of what that future could look like.

```